1. Importing Libraries +Bring in Data
1.3 Description of the Time-Series Dataset
1.4 Description of the Qualitative Dataset
1.8 Outliers Detection and Treatment
2.2 Forecasting Target Variables
2.21 Target Variable- Total Inside Sales
2.22 Target Variable- Total Food Sales
2.23 Target Variable- Diesel Sales
3.1 Target Variable- Total Inside Sales
3.2 Target Variable- Total Food Sales
In the fast-paced world of retail, businesses often experience the challenge of balancing high growth ambitions with the need for accurate forecasts to make well-informed financial decisions. Maverik, a thriving retail brand, is no exception. With plans to open or build 30 new stores each year, Maverik's high-growth trajectory requires precise first-year sales forecasts. These forecasts serve as the cornerstone for building effective financial plans, optimizing resource allocation, and providing a benchmark to assess store performance against projected outcomes. The success of this endeavor hinges on two primary factors: achieving a Return on Investment (ROI) that closely matches forecasted ROI and consistently attaining accurate sales metric forecasts as new sales data becomes available.
To address this critical business challenge, Maverik has embarked on a data-driven journey, combining the power of qualitative and time series data to develop a sophisticated time series forecasting model in Python. This model leverages qualitative insights from recent new store openings, network-wide seasonality patterns, and a diverse set of sales metrics. It will employ a variety of machine learning algorithms to generate daily-level forecasts, enabling Maverik to make precise financial and operational decisions.
The scope of this project encompasses the end-to-end development of the time series forecasting model. It will incorporate qualitative data, network-wide seasonality patterns, and other relevant factors to deliver accurate forecasts for the first year of sales for new stores.
This series of modeling tasks will explore the utilization of machine learning techniques to create a forecasting model tailored to Maverik's unique needs. It aims to provide Maverik with the insights and tools necessary to maintain its impressive growth while making data-driven decisions that foster long-term success.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.subplots as sp
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from sklearn.model_selection import train_test_split
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import os
os.chdir('/content/drive/MyDrive/maverik-capstone') # changing the default directory
qualitative_data_path = 'qualitative_data_msba.csv'
df_qdm = pd.read_csv(qualitative_data_path, index_col=0)
time_series_data_path = 'time_series_data_msba.csv'
df_tsdm = pd.read_csv(time_series_data_path, index_col=0)
df_tsdm.head()
| capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | calendar.day_of_week | calendar_information.holiday | calendar_information.type_of_day | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 6/14/2022 | 6/17/2022 | 25 | Friday | NONE | WEEKDAY | 2168.2920 | 861.6930 | 722.7745 | 1425.1020 | 24535 |
| 2 | 6/14/2022 | 6/22/2022 | 25 | Wednesday | NONE | WEEKDAY | 2051.5635 | 808.0275 | 730.4850 | 1436.2740 | 24535 |
| 3 | 6/14/2022 | 6/23/2022 | 25 | Thursday | NONE | WEEKDAY | 2257.5000 | 966.4410 | 895.7970 | 1594.3725 | 24535 |
| 4 | 6/14/2022 | 6/26/2022 | 26 | Sunday | NONE | WEEKEND | 1520.5925 | 542.3250 | 584.2900 | 1124.9280 | 24535 |
| 5 | 6/14/2022 | 6/27/2022 | 26 | Monday | NONE | WEEKDAY | 1897.6930 | 771.4525 | 852.2605 | 1640.2540 | 24535 |
df_tsdm.columns
Index(['capital_projects.soft_opening_date', 'calendar.calendar_day_date',
'calendar.fiscal_week_id_for_year', 'calendar.day_of_week',
'calendar_information.holiday', 'calendar_information.type_of_day',
'daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service',
'diesel', 'unleaded', 'site_id_msba'],
dtype='object')
df_tsdm.shape
(13908, 11)
df_tsdm.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 13908 entries, 1 to 13908 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 capital_projects.soft_opening_date 13908 non-null object 1 calendar.calendar_day_date 13908 non-null object 2 calendar.fiscal_week_id_for_year 13908 non-null int64 3 calendar.day_of_week 13908 non-null object 4 calendar_information.holiday 13908 non-null object 5 calendar_information.type_of_day 13908 non-null object 6 daily_yoy_ndt.total_inside_sales 13908 non-null float64 7 daily_yoy_ndt.total_food_service 13908 non-null float64 8 diesel 13908 non-null float64 9 unleaded 13908 non-null float64 10 site_id_msba 13908 non-null int64 dtypes: float64(4), int64(2), object(5) memory usage: 1.3+ MB
df_tsdm.describe()
| calendar.fiscal_week_id_for_year | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel | unleaded | site_id_msba | |
|---|---|---|---|---|---|---|
| count | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 | 13908.000000 |
| mean | 26.501079 | 2846.537988 | 759.922326 | 1702.585227 | 2382.091588 | 23041.052632 |
| std | 14.998715 | 981.299870 | 341.578220 | 2161.208192 | 1025.518658 | 710.634218 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 240.180500 | 21560.000000 |
| 25% | 14.000000 | 2181.156250 | 521.087875 | 383.062750 | 1654.149000 | 22540.000000 |
| 50% | 26.000000 | 2693.976250 | 697.434500 | 1018.920000 | 2256.677500 | 22907.500000 |
| 75% | 39.000000 | 3325.306250 | 924.282625 | 2283.297625 | 2928.254000 | 23555.000000 |
| max | 52.000000 | 7172.466000 | 2531.662000 | 20853.952000 | 8077.233500 | 24535.000000 |
df_qdm.head()
| open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | lottery | freal | bonfire_grill | pizza | cinnabon | ... | rv_lanes_fueling_positions_2 | hi_flow_rv_lanes_layout | hi_flow_rv_lanes_stack_type | non_24_hour | self_check_out | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2021 | 5046 | 2 | 2 | 38 | Yes | Yes | Yes | No | No | ... | 6 | Stack | HF/RV | No | Yes | 2 | 2 | 6 | 2 | 21560 |
| 2 | 2021 | 5046 | 2 | 2 | 39 | No | Yes | Yes | Yes | No | ... | 4 | Combo | HF/RV | No | Yes | 5 | 5 | 10 | 4 | 21980 |
| 3 | 2021 | 5046 | 2 | 2 | 35 | Yes | Yes | Yes | Yes | No | ... | 5 | In-Line | None | No | Yes | 3 | 2 | 4 | 1 | 22015 |
| 4 | 2021 | 5046 | 2 | 2 | 36 | No | Yes | Yes | Yes | No | ... | 4 | Combo | HF/RV | No | Yes | 3 | 3 | 6 | 2 | 22085 |
| 5 | 2021 | 5046 | 2 | 2 | 25 | Yes | Yes | Yes | No | No | ... | 0 | NaN | NaN | No | Yes | 0 | 0 | 0 | 0 | 22120 |
5 rows × 54 columns
df_qdm.columns
Index(['open_year', 'square_feet', 'front_door_count',
'years_since_last_project', 'parking_spaces', 'lottery', 'freal',
'bonfire_grill', 'pizza', 'cinnabon', 'godfather_s_pizza',
'ethanol_free', 'diesel', 'hi_flow_lanes', 'rv_lanes',
'hi_flow_rv_lanes', 'def', 'cat_scales', 'car_wash', 'ev_charging',
'rv_dumps', 'propane', 'x1_mile_pop', 'x1_mile_emp', 'x1_mile_income',
'x1_2_mile_pop', 'x1_2_mile_emp', 'x1_2_mile_income', 'x5_min_pop',
'x5_min_emp', 'x5_min_inc', 'x7_min_pop', 'x7_min_emp', 'x7_min_inc',
'traditional_forecourt_fueling_positions',
'traditional_forecourt_layout', 'traditional_forecourt_stack_type',
'rv_lanes_fueling_positions', 'rv_lanes_layout', 'rv_lanes_stack_type',
'hi_flow_lanes_fueling_positions', 'hi_flow_lanes_layout',
'hi_flow_lanes_stack_type', 'hi_flow_lanes_fueling_positions_2',
'rv_lanes_fueling_positions_2', 'hi_flow_rv_lanes_layout',
'hi_flow_rv_lanes_stack_type', 'non_24_hour', 'self_check_out',
'mens_toilet_count', 'mens_urinal_count', 'womens_toilet_count',
'womens_sink_count', 'site_id_msba'],
dtype='object')
df_qdm.shape
(37, 54)
df_qdm.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 37 entries, 1 to 37 Data columns (total 54 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 open_year 37 non-null int64 1 square_feet 37 non-null int64 2 front_door_count 37 non-null int64 3 years_since_last_project 37 non-null int64 4 parking_spaces 37 non-null int64 5 lottery 37 non-null object 6 freal 37 non-null object 7 bonfire_grill 37 non-null object 8 pizza 37 non-null object 9 cinnabon 37 non-null object 10 godfather_s_pizza 37 non-null object 11 ethanol_free 37 non-null object 12 diesel 37 non-null object 13 hi_flow_lanes 37 non-null object 14 rv_lanes 37 non-null object 15 hi_flow_rv_lanes 37 non-null object 16 def 37 non-null object 17 cat_scales 37 non-null object 18 car_wash 37 non-null object 19 ev_charging 37 non-null object 20 rv_dumps 37 non-null object 21 propane 37 non-null object 22 x1_mile_pop 37 non-null int64 23 x1_mile_emp 37 non-null int64 24 x1_mile_income 37 non-null int64 25 x1_2_mile_pop 37 non-null int64 26 x1_2_mile_emp 37 non-null int64 27 x1_2_mile_income 37 non-null int64 28 x5_min_pop 37 non-null int64 29 x5_min_emp 37 non-null int64 30 x5_min_inc 37 non-null int64 31 x7_min_pop 37 non-null int64 32 x7_min_emp 37 non-null int64 33 x7_min_inc 37 non-null int64 34 traditional_forecourt_fueling_positions 37 non-null int64 35 traditional_forecourt_layout 37 non-null object 36 traditional_forecourt_stack_type 37 non-null object 37 rv_lanes_fueling_positions 37 non-null int64 38 rv_lanes_layout 23 non-null object 39 rv_lanes_stack_type 23 non-null object 40 hi_flow_lanes_fueling_positions 37 non-null int64 41 hi_flow_lanes_layout 22 non-null object 42 hi_flow_lanes_stack_type 22 non-null object 43 hi_flow_lanes_fueling_positions_2 37 non-null int64 44 rv_lanes_fueling_positions_2 37 non-null int64 45 hi_flow_rv_lanes_layout 23 non-null object 46 hi_flow_rv_lanes_stack_type 23 non-null object 47 non_24_hour 37 non-null object 48 self_check_out 37 non-null object 49 mens_toilet_count 37 non-null int64 50 mens_urinal_count 37 non-null int64 51 womens_toilet_count 37 non-null int64 52 womens_sink_count 37 non-null int64 53 site_id_msba 37 non-null int64 dtypes: int64(27), object(27) memory usage: 15.9+ KB
df_qdm.describe()
| open_year | square_feet | front_door_count | years_since_last_project | parking_spaces | x1_mile_pop | x1_mile_emp | x1_mile_income | x1_2_mile_pop | x1_2_mile_emp | ... | traditional_forecourt_fueling_positions | rv_lanes_fueling_positions | hi_flow_lanes_fueling_positions | hi_flow_lanes_fueling_positions_2 | rv_lanes_fueling_positions_2 | mens_toilet_count | mens_urinal_count | womens_toilet_count | womens_sink_count | site_id_msba | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 37.000000 | 37.00000 | 37.0 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | ... | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 | 37.000000 |
| mean | 2021.324324 | 4970.27027 | 2.0 | 1.648649 | 37.405405 | 6703.567568 | 4757.648649 | 53300.378378 | 1833.108108 | 1514.135135 | ... | 14.270270 | 2.513514 | 3.324324 | 3.324324 | 2.513514 | 2.378378 | 2.351351 | 4.648649 | 1.702703 | 23040.405405 |
| std | 0.474579 | 575.93121 | 0.0 | 0.483978 | 5.918237 | 5694.011350 | 4697.168291 | 24333.027254 | 1915.140476 | 2489.423094 | ... | 3.948619 | 2.049683 | 2.925501 | 2.925501 | 2.049683 | 0.923500 | 0.856875 | 1.751447 | 0.740303 | 730.069801 |
| min | 2021.000000 | 2933.00000 | 2.0 | 1.000000 | 23.000000 | 0.000000 | 56.000000 | 0.000000 | 0.000000 | 31.000000 | ... | 10.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 21560.000000 |
| 25% | 2021.000000 | 5046.00000 | 2.0 | 1.000000 | 34.000000 | 1984.000000 | 1771.000000 | 39538.000000 | 262.000000 | 386.000000 | ... | 12.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 2.000000 | 4.000000 | 1.000000 | 22540.000000 |
| 50% | 2021.000000 | 5046.00000 | 2.0 | 2.000000 | 38.000000 | 5574.000000 | 3895.000000 | 46356.000000 | 1003.000000 | 1037.000000 | ... | 12.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 2.000000 | 2.000000 | 4.000000 | 2.000000 | 22890.000000 |
| 75% | 2022.000000 | 5046.00000 | 2.0 | 2.000000 | 41.000000 | 11269.000000 | 6002.000000 | 73519.000000 | 2686.000000 | 1616.000000 | ... | 16.000000 | 4.000000 | 5.000000 | 5.000000 | 4.000000 | 3.000000 | 3.000000 | 6.000000 | 2.000000 | 23555.000000 |
| max | 2022.000000 | 6134.00000 | 2.0 | 2.000000 | 49.000000 | 18692.000000 | 26077.000000 | 110957.000000 | 5923.000000 | 15403.000000 | ... | 24.000000 | 6.000000 | 9.000000 | 9.000000 | 6.000000 | 5.000000 | 5.000000 | 10.000000 | 4.000000 | 24535.000000 |
8 rows × 27 columns
# Calculate the number of missing values for each column in the qualitative DataFrame
missing_qualitative = df_qdm.isnull().sum()
# Calculate the number of missing values for each column in the time series DataFrame
missing_time_series = df_tsdm.isnull().sum()
# Show columns from the qualitative DataFrame that have any missing values
missing_qualitative[missing_qualitative > 0]
rv_lanes_layout 14 rv_lanes_stack_type 14 hi_flow_lanes_layout 15 hi_flow_lanes_stack_type 15 hi_flow_rv_lanes_layout 14 hi_flow_rv_lanes_stack_type 14 dtype: int64
# Check unique values in columns with missing values in qualitative DataFrame
print('rv_lanes_layout Unique values: ', df_qdm['rv_lanes_layout'].unique())
print('rv_lanes_stack_type Unique values: ', df_qdm['rv_lanes_stack_type'].unique())
print('hi_flow_lanes_layout Unique values: ', df_qdm['hi_flow_lanes_layout'].unique())
print('hi_flow_lanes_stack_type Unique values: ', df_qdm['hi_flow_lanes_stack_type'].unique())
print('hi_flow_rv_lanes_layout Unique values: ', df_qdm['hi_flow_rv_lanes_layout'].unique())
print('hi_flow_rv_lanes_stack_type Unique values: ', df_qdm['hi_flow_rv_lanes_stack_type'].unique())
rv_lanes_layout Unique values: ['Stack' 'In-Line' nan] rv_lanes_stack_type Unique values: ['HF/RV' 'None' nan] hi_flow_lanes_layout Unique values: ['Stack' 'Combo' nan] hi_flow_lanes_stack_type Unique values: ['HF/RV' nan] hi_flow_rv_lanes_layout Unique values: ['Stack' 'Combo' 'In-Line' nan] hi_flow_rv_lanes_stack_type Unique values: ['HF/RV' 'None' nan]
# Fill all Missing values in the qualitative DataFrame with a new category 'None'
df_qdm = df_qdm.fillna('None')
# Show columns from the qualitative DataFrame that have any missing values
missing_time_series[missing_time_series > 0]
Series([], dtype: int64)
merged_df = pd.merge(df_tsdm, df_qdm, on='site_id_msba', how='inner')
# Convert date columns to datetime format
merged_df['calendar.calendar_day_date'] = pd.to_datetime(merged_df['calendar.calendar_day_date'])
merged_df['capital_projects.soft_opening_date'] = pd.to_datetime(merged_df['capital_projects.soft_opening_date'])
# Sort Values by site id and calendar day date
merged_df = merged_df.sort_values(['site_id_msba', 'calendar.calendar_day_date'])
# Create new fetaures based on the existing date features
merged_df['day_of_week'] = merged_df['calendar.calendar_day_date'].dt.dayofweek
merged_df['calendar_year'] = merged_df['calendar.calendar_day_date'].dt.year
merged_df['calendar_month'] = merged_df['calendar.calendar_day_date'].dt.month
merged_df['calendar_day'] = merged_df['calendar.calendar_day_date'].dt.day
# Function to detect outliers
def detect_outliers_iqr(data, feature):
Q1 = data[feature].quantile(0.25)
Q3 = data[feature].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)]
return outliers, lower_bound, upper_bound
target_features = ["daily_yoy_ndt.total_inside_sales", "daily_yoy_ndt.total_food_service", "diesel_x", "unleaded"]
outliers_info = {}
# Detecting outliers for each target feature
for feature in target_features:
outliers, lower_bound, upper_bound = detect_outliers_iqr(merged_df, feature)
outliers_info[feature] = {
"outliers_count": outliers.shape[0],
"lower_bound": lower_bound,
"upper_bound": upper_bound,
"outliers_index": outliers.index
}
# Outliers information for each target variable
outliers_summary = pd.DataFrame(outliers_info).T[['outliers_count', 'lower_bound', 'upper_bound']]
outliers_summary
| outliers_count | lower_bound | upper_bound | |
|---|---|---|---|
| daily_yoy_ndt.total_inside_sales | 490 | 451.423875 | 5068.637875 |
| daily_yoy_ndt.total_food_service | 517 | -84.042 | 1541.967 |
| diesel_x | 640 | -2480.291625 | 5198.477375 |
| unleaded | 326 | -304.937062 | 4904.032438 |
# Capping the outliers to the upper and lower bounds for each target feature
for feature in target_features:
lower_bound = outliers_info[feature]['lower_bound']
upper_bound = outliers_info[feature]['upper_bound']
merged_df[feature] = merged_df[feature].clip(lower=lower_bound, upper=upper_bound)
# Check the data to ensure the capping was applied
merged_df_describe = merged_df[target_features].describe()
merged_df_describe
| daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel_x | unleaded | |
|---|---|---|---|---|
| count | 13542.000000 | 13542.000000 | 13542.000000 | 13542.000000 |
| mean | 2831.571136 | 756.807500 | 1557.357496 | 2371.155283 |
| std | 924.481068 | 313.037819 | 1410.406522 | 978.457952 |
| min | 451.423875 | 0.000000 | 12.498500 | 240.180500 |
| 25% | 2182.879125 | 525.711375 | 399.246750 | 1648.426500 |
| 50% | 2694.347250 | 705.759250 | 1083.677000 | 2263.154250 |
| 75% | 3337.182625 | 932.213625 | 2318.939000 | 2950.668875 |
| max | 5068.637875 | 1541.967000 | 5198.477375 | 4904.032438 |
A considerable number of outliers were found for each of the target variables when the IQR method's 1.5 multiplier was used for outlier detection. This finding raises the possibility of either a significant variation in the data or a high sensitivity of this method to the features of the dataset. With a large range between the lower and upper bounds, the descriptive statistics performed prior to capping suggested that these outliers might have an impact on the overall profile of the dataset. Following capping, the dataset was shown to have less variability and possibly more stability for further analysis based on the lower standard deviation and the alignment of maximum values with the upper bounds. During the modeling process, we can determine if there is a significant effect of outliers on the residuals.
# Function to perform the Augmented Dickey-Fuller test
def adf_test(series, feature_name):
result = adfuller(series, autolag='AIC')
adf_statistic = result[0]
p_value = result[1]
critical_values = result[4]
return {
"Feature": feature_name,
"ADF Statistic": adf_statistic,
"p-value": p_value,
"Critical Values": critical_values,
"Stationarity": p_value < 0.05 # If p-value < 0.05, we reject the null hypothesis (series is stationary)
}
# Applying the ADF test on the target features
adf_results = pd.DataFrame([adf_test(merged_df[feature], feature) for feature in target_features])
adf_results[['Feature', 'ADF Statistic', 'p-value', 'Stationarity']]
| Feature | ADF Statistic | p-value | Stationarity | |
|---|---|---|---|---|
| 0 | daily_yoy_ndt.total_inside_sales | -5.972455 | 1.921862e-07 | True |
| 1 | daily_yoy_ndt.total_food_service | -4.959112 | 2.672214e-05 | True |
| 2 | diesel_x | -5.228578 | 7.684673e-06 | True |
| 3 | unleaded | -5.464509 | 2.476066e-06 | True |
All variables show evidence of stationarity, according to the results of the Augmented Dickey-Fuller(ADF) test applied to the target features, as each variable's test statistics are below the critical values and its p-value is significantly below the 0.05 threshold.
This implies that these variables time series data do not have unit roots and are stable over time, indicating that their statistical characteristics such as variance and mean are not time-dependent.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
# Funtion to plot Autocorrelation plots
def plot_auto_correlation_plots(target):
# Create subplots with 1 row and 2 columns
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 4))
# Plot Autocorrelation Function (ACF)
plot_acf(merged_df[target], ax=axes[0])
axes[0].set_title(f'Autocorrelation Function for {target}')
axes[0].set_xlabel('Lags') # Set x-axis label
axes[0].set_ylabel('Partial Autocorrelation')
# Plot Partial Autocorrelation Function (PACF)
plot_pacf(merged_df[target], ax=axes[1])
axes[1].set_title(f'Partial Autocorrelation Function for {target}')
axes[1].set_xlabel('Lags') # Set x-axis label
axes[1].set_ylabel('Partial Autocorrelation')
# Show the plots
plt.tight_layout()
plt.show()
# Iterate through the target features to plot the autocorrelation plots
target_features = ["daily_yoy_ndt.total_inside_sales", "daily_yoy_ndt.total_food_service", "diesel_x", "unleaded"]
for target in target_features:
plot_auto_correlation_plots(target)
The ACF plots for both "daily_yoy_ndt.total_inside_sales" and "daily_yoy_ndt.total_food_service" indicate a strong positive correlation at lag 1 that gradually declines, yet remains significant across many lags, suggesting non-stationarity and a potential trend or seasonal pattern in the data. The PACF plot for "total_inside_sales" shows a significant spike at lag 1 and then cuts off, whereas the "total_food_service" PACF plot suggests a less clear AR structure with significant lags occurring at various intervals. Both datasets are likely to benefit from differencing to remove the non-stationary components and may require seasonal adjustment given the repeating patterns in the ACF. The autocorrelation function (ACF) plots for both "diesel_x" and "unleaded" show a very gradual decline in the correlations as lags increase, which implies a strong persistent autocorrelation over time, indicative of a non-stationary time series.
# Create a list of categorical columns
categorical_columns = list(merged_df.select_dtypes(include=['object']).columns)
# One-Hot Encoding for categorical variables
one_hot_encoder = OneHotEncoder(sparse=False, drop='first')
encoded_categorical_columns = pd.DataFrame(
one_hot_encoder.fit_transform(merged_df[categorical_columns]),
index=merged_df.index
)
# Join the encoded columns back to the dataframe
merged_df = merged_df.join(encoded_categorical_columns).drop(columns=categorical_columns)
# Display head for encoded DataFrame
merged_df.head()
| capital_projects.soft_opening_date | calendar.calendar_day_date | calendar.fiscal_week_id_for_year | daily_yoy_ndt.total_inside_sales | daily_yoy_ndt.total_food_service | diesel_x | unleaded | site_id_msba | open_year | square_feet | ... | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13447 | 2021-01-12 | 2021-01-12 | 2 | 2036.2685 | 762.8530 | 1424.1850 | 1522.0030 | 21560 | 2021 | 5046 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 13448 | 2021-01-12 | 2021-01-13 | 2 | 2379.8880 | 1003.7930 | 1303.8445 | 1853.7715 | 21560 | 2021 | 5046 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 13449 | 2021-01-12 | 2021-01-14 | 2 | 2435.8600 | 974.2250 | 1375.6785 | 2122.4070 | 21560 | 2021 | 5046 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 13359 | 2021-01-12 | 2021-01-15 | 3 | 2805.9780 | 911.0115 | 1334.9175 | 2267.9930 | 21560 | 2021 | 5046 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 13450 | 2021-01-12 | 2021-01-16 | 3 | 2314.7635 | 715.7535 | 831.1625 | 1819.6395 | 21560 | 2021 | 5046 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 96 columns
# Get a list of unique stores
unique_stores_list = merged_df['site_id_msba'].unique()
# Split the stores into training and testing sets
train_stores, test_stores = train_test_split(unique_stores_list, test_size=5, random_state=42) # Using a fixed random state for reproducibility
# Split the data into training and testing based on the stores
train_data_og = merged_df[merged_df['site_id_msba'].isin(train_stores)]
test_data_og = merged_df[merged_df['site_id_msba'].isin(test_stores)]
# Let's check the number of entries in the training and testing sets
len_train = len(train_data_og)
len_test = len(test_data_og)
(len_train, len_test)
(11712, 1830)
Lag and rolling features are commonly used in time-series analysis and forecasting. These features involve using historical data to generate new features that can provide insights into trends, patterns, or relationships within a dataset
def create_lag_features(target_variable):
# Creating lagged features for up to 7 days
train_data = train_data_og.copy()
test_data = test_data_og.copy()
for lag in range(1, 8):
train_data[f"{target_variable}_lag_{lag}"] = train_data[target_variable].shift(lag)
test_data[f"{target_variable}_lag_{lag}"] = test_data[target_variable].shift(lag)
# Creating rolling window features (7 days and 30 days)
train_data[f'{target_variable}_rolling_7_mean'] = train_data[target_variable].rolling(window=7).mean()
train_data[f'{target_variable}_rolling_30_mean'] = train_data[target_variable].rolling(window=30).mean()
train_data[f'{target_variable}_rolling_7_std'] = train_data[target_variable].rolling(window=7).std()
train_data[f'{target_variable}_rolling_30_std'] = train_data[target_variable].rolling(window=30).std()
test_data[f'{target_variable}_rolling_7_mean'] = test_data[target_variable].rolling(window=7).mean()
test_data[f'{target_variable}_rolling_30_mean'] = test_data[target_variable].rolling(window=30).mean()
test_data[f'{target_variable}_rolling_7_std'] = test_data[target_variable].rolling(window=7).std()
test_data[f'{target_variable}_rolling_30_std'] = test_data[target_variable].rolling(window=30).std()
# The first few rows will have NaN values due to lagging and rolling, so we drop them.
train_data.dropna(inplace=True)
test_data.dropna(inplace=True)
return train_data, test_data
pd.set_option('display.max_colwidth', None)
The create_lag_features function generates lag and rolling features for a given target variable in time-series data. It starts by creating lagged features for the target variable spanning up to 7 days, both for the training and test datasets. Following that, rolling window features are computed, including the 7-day and 30-day rolling mean and standard deviation. This process enhances the dataset by capturing temporal patterns and trends within the time series. However, as lag and rolling operations introduce NaN values in the initial rows, the function concludes by removing these rows from both the training and test datasets. Overall, this function is designed to prepare the data for time-series analysis or forecasting by incorporating valuable historical information through lag and rolling window features.
In our quest for forecasting sales of various stores, we systematically evaluated a diverse array of candidates, encompassing XGBoost, Linear Regression, Random Forest, SARIMA (Seasonal AutoRegressive Integrated Moving Average), and Prophet. Each of these models was chosen for its unique strengths and capabilities, contributing to a comprehensive exploration of potential solutions for our predictive analytics task.
XGBoost, a robust and flexible ensemble learning algorithm, was considered for its ability to handle complex relationships within the data and deliver accurate predictions. Linear Regression provided a fundamental baseline model, aiding in the interpretation of linear relationships within our dataset. Random Forest, known for its ensemble of decision trees, offered versatility and resilience to outliers, further enriching our model portfolio. SARIMA, as a dedicated time series model, was incorporated to capture potential temporal patterns and dependencies present in the data. Prophet, a specialized forecasting tool developed by Facebook, was included for its adeptness in handling time series data with strong seasonal components.
The final selection of the best model was based on a holistic evaluation of cross-validated performance metrics. Criteria such as MAE (Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error), and R2 (R-squared) were considered to identify the model that demonstrated superior predictive capabilities on both training and validation sets. The chosen model represents a balanced solution, leveraging the strengths of each candidate model within the context of our specific predictive analytics objectives.
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
# Function to train XGBoost model and make predictions
def train_and_forecast(train_data, test_data, target_variable):
# Prepare features and target
X_train = train_data.drop(columns=['daily_yoy_ndt.total_inside_sales',
'daily_yoy_ndt.total_food_service',
'diesel_x',
'unleaded',
'site_id_msba',
'calendar.calendar_day_date',
'capital_projects.soft_opening_date'],axis=1)
y_train = train_data[target_variable]
X_test = test_data.drop(columns=['daily_yoy_ndt.total_inside_sales',
'daily_yoy_ndt.total_food_service',
'diesel_x',
'unleaded',
'site_id_msba',
'calendar.calendar_day_date',
'capital_projects.soft_opening_date'],axis=1)
y_test = test_data[target_variable]
# Initialize the XGBoost regressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
# Train the model using the training data
model.fit(X_train, y_train)
# Predict on the training data and the test data
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)
# Calculate the performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, train_predictions))
test_rmse = np.sqrt(mean_squared_error(y_test, test_predictions))
train_r2 = r2_score(y_train, train_predictions)
test_r2 = r2_score(y_test, test_predictions)
# Create a DataFrame for actual vs predicted values for the test set
predictions_df = pd.DataFrame({
'Actual': y_test,
'Predicted': test_predictions
})
return model, predictions_df, (train_rmse, test_rmse, train_r2, test_r2)
# Forecast function to create additional features for the future dates and make predictions
def forecast_future(model, data, target_variable, periods=365):
# Forecast for each store
for store_id in data['site_id_msba'].unique():
store_data = data[data['site_id_msba'] == store_id].copy()
actual_dates = store_data['calendar.calendar_day_date']
actual_store_data = store_data.copy()
store_data.set_index('calendar.calendar_day_date', inplace=True, drop = False)
last_date = store_data['calendar.calendar_day_date'].max()
future_dates = pd.date_range(start=last_date, periods=periods + 1, closed='right')
future_data = pd.DataFrame(index=future_dates)
forecasts = pd.DataFrame(index=future_dates)
for date in future_dates:
if date not in store_data.index:
store_data = store_data.append(pd.Series(name=date))
# Creating features for the date
store_data['day_of_week'] = date.dayofweek
store_data['calendar_year'] = date.year
store_data['calendar_month'] = date.month
store_data['calendar_day'] = date.day
# Create lag and rolling features based on previous data
for lag in range(1, 8):
store_data[f"{target_variable}_lag_{lag}"] = store_data[target_variable].shift(lag)
store_data[f'{target_variable}_rolling_7_mean'] = store_data[target_variable].rolling(window=7).mean()
store_data[f'{target_variable}_rolling_30_mean'] = store_data[target_variable].rolling(window=30).mean()
store_data[f'{target_variable}_rolling_7_std'] = store_data[target_variable].rolling(window=7).std()
store_data[f'{target_variable}_rolling_30_std'] = store_data[target_variable].rolling(window=30).std()
# Drop the rows with NaN values created by shifting and rolling
if len(store_data.dropna()) >= 7:
store_data.dropna(inplace=True)
if not store_data.empty:
# Prepare the last row (most recent data) as input for prediction
X_future = store_data.drop(columns=['daily_yoy_ndt.total_inside_sales',
'daily_yoy_ndt.total_food_service',
'diesel_x',
'unleaded',
'site_id_msba',
'calendar.calendar_day_date',
'capital_projects.soft_opening_date'], axis=1).iloc[-1].to_frame().transpose()
# Predict the target variable for the current date
prediction = model.predict(X_future)[0]
forecasts.loc[date, store_id] = prediction
# Update the target variable with the prediction for the next prediction step
store_data.loc[date, target_variable] = prediction
forecasts['calendar.calendar_day_date'] = forecasts.index
dates = actual_store_data['calendar.calendar_day_date'].append(forecasts['calendar.calendar_day_date'] )
values = pd.concat([actual_store_data[target_variable], forecasts[store_id]])
plt.figure(figsize=(14, 7))
# Plot actual past data
plt.plot(actual_store_data['calendar.calendar_day_date'], actual_store_data[target_variable], label='Actual', color='blue')
# Find the forecast start date index for the plotting
actual_store_data.set_index('calendar.calendar_day_date', inplace=True)
forecast_start_index = actual_store_data.index.get_loc(last_date) + 1
# Plot future forecasts
plt.plot(dates[forecast_start_index:], values[forecast_start_index:], label='Forecast', color='orange', linestyle='--')
plt.title(f'{target_variable} Forecast for Store {store_id}')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()
forecasts.to_csv(f'/content/{store_id}_out.csv')
return forecasts
The above script leverages the XGBoost machine learning model for time-series forecasting. The train_and_forecast function first trains the XGBoost model using historical training data, evaluating its performance on both the training and test sets. This function returns the trained model, a DataFrame comparing actual and predicted values, and key performance metrics such as root mean squared error (RMSE) and R-squared (R2).
The second function, forecast_future, utilizes the trained model to generate forecasts for future dates. It iterates over each store, creating features for future dates based on lag and rolling window operations. The predicted values are then incorporated into the dataset for subsequent predictions. The script concludes by visualizing the forecasted sales alongside actual data for each store, saving the forecasts to CSV files, and providing insights into future sales trends for decision-making purposes.
# Train the model and forecast for the "total_inside_sales" target variable
target_variable = "daily_yoy_ndt.total_inside_sales"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)
# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)
# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")
forecasts.head()
Train Root Mean Squared Error (RMSE): 159.0069894762768 Test Root Mean Squared Error (RMSE): 275.9534972545057 Train R-squared (R2): 0.9713151651872407 Test R-squared (R2): 0.8807128859604085
| 24255 | calendar.calendar_day_date | |
|---|---|---|
| 2023-08-10 | 2248.239990 | 2023-08-10 |
| 2023-08-11 | 2316.237793 | 2023-08-11 |
| 2023-08-12 | 2121.083496 | 2023-08-12 |
| 2023-08-13 | 1924.720337 | 2023-08-13 |
| 2023-08-14 | 2329.705322 | 2023-08-14 |
The forecasting model employed for predicting 'daily_yoy_ndt.total_inside_sales' across various stores showcases a commendable R2 score of 0.88, signifying that the model is capable of explaining a significant portion of the variance in the sales data. With a RMSE of 275.95, the predictions are in a reasonable range from the actual sales figures, although the train R2 and RMSE are higher than the test, possibly due to outliers, overfitting or unaccounted factors in the sales trends. Looking at the Forecast plots the model has difficulty capturing extreme values or spikes in sales, which could be due to many factors, including possible outliers, non-linear trends, or irregular patterns in the sales data that the model cannot account for.
# Train the model and forecast for the "total_food_service" target variable
target_variable = "daily_yoy_ndt.total_food_service"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)
# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)
# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")
forecasts.head()
Train Root Mean Squared Error (RMSE): 51.945964715763616 Test Root Mean Squared Error (RMSE): 99.59938904171811 Train R-squared (R2): 0.9744054859512813 Test R-squared (R2): 0.7895828421958543
| 24255 | calendar.calendar_day_date | |
|---|---|---|
| 2023-08-10 | 793.250610 | 2023-08-10 |
| 2023-08-11 | 795.345215 | 2023-08-11 |
| 2023-08-12 | 662.977905 | 2023-08-12 |
| 2023-08-13 | 615.353699 | 2023-08-13 |
| 2023-08-14 | 770.036804 | 2023-08-14 |
The performance metrics from the forecasting model for 'daily_yoy_ndt.total_food_service' suggest a model with reasonable predictive accuracy, as evidenced by an R2 score of 0.78, indicating that the model accounts for a substantial amount of variability in the food service sales data. The RMSE of 99.59 points to there is possibility of the model overfitting, given that the R2 score is very high in the train data and drops significantly in the test data. The plots reinforce that the model is not capturing the full range of variability in the sales data, particularly the peaks and valleys.
# Train the model and forecast for the "diesel" target variable
target_variable = "diesel_x"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)
# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)
# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")
forecasts.head()
Train Root Mean Squared Error (RMSE): 156.02656293069364 Test Root Mean Squared Error (RMSE): 1381.8849593065484 Train R-squared (R2): 0.9869627683251245 Test R-squared (R2): 0.246121573472739
| 24255 | calendar.calendar_day_date | |
|---|---|---|
| 2023-08-10 | 3157.135010 | 2023-08-10 |
| 2023-08-11 | 2913.590820 | 2023-08-11 |
| 2023-08-12 | 2451.014404 | 2023-08-12 |
| 2023-08-13 | 2374.418457 | 2023-08-13 |
| 2023-08-14 | 3065.634277 | 2023-08-14 |
The forecasting model for 'diesel_x' displays a poor R2 score of 0.24, indicating that the model's ability to predict diesel sales is weaker compared to the previous two targets. The RMSE is relatively high at 1381.88, suggesting that the model's sales predictions deviate substantially from the actual figures for the stores. The vast difference in RMSE among the stores indicates that the model struggles with consistency across different locations, likely due to location-specific factors that affect diesel sales and are not captured by the model. The forecast seems to predict constant sales, which does not reflect the actual sales trend. The model is likely overfitting to the training data and not capturing the underlying sales patterns effectively. This is evident from the high RMSE and low R2 on the test set, along with the discrepancy between the actual and forecasted sales in the plots.
# Train the model and forecast for the "unleaded" target variable
target_variable = "unleaded"
train_data_lagged, test_data_lagged = create_lag_features(target_variable)
model, predictions_df, metrics = train_and_forecast(train_data_lagged, test_data_lagged, target_variable)
# Forecast the future for each store
forecasts = forecast_future(model, test_data_lagged, target_variable)
# Output the performance metrics and a snippet of the forecasts with metric names
train_rmse, test_rmse, train_r2, test_r2 = metrics
print(f"Train Root Mean Squared Error (RMSE): {train_rmse}")
print(f"Test Root Mean Squared Error (RMSE): {test_rmse}")
print(f"Train R-squared (R2): {train_r2}")
print(f"Test R-squared (R2): {test_r2}")
forecasts.head()
Train Root Mean Squared Error (RMSE): 165.74368076600695 Test Root Mean Squared Error (RMSE): 313.03822722415134 Train R-squared (R2): 0.9722252196888015 Test R-squared (R2): 0.8697455429007126
| 24255 | calendar.calendar_day_date | |
|---|---|---|
| 2023-08-10 | 1625.639526 | 2023-08-10 |
| 2023-08-11 | 1654.871582 | 2023-08-11 |
| 2023-08-12 | 1370.459106 | 2023-08-12 |
| 2023-08-13 | 1340.163086 | 2023-08-13 |
| 2023-08-14 | 1605.919678 | 2023-08-14 |
The forecasting model's performance on 'unleaded' fuel sales reveals a mixed picture with an R2 score of 0.97, suggesting a fairly good fit where the model is able to explain a majority of the variance in sales. However, the RMSE of 313.53 points to considerable prediction errors across the dataset. The model still fails to capture the high variability and the spikes in the sales data which points to overfitting.
# OG Define a function to prepare the dataset for training
def prepare_training_data(data, store_ids, target_variable):
# Filter the data for the specified stores
data_filtered = data[data['site_id_msba'].isin(store_ids)]
# Keep only the first year of data for each store
data_first_year = data_filtered[data_filtered['calendar.calendar_day_date'] < data_filtered['capital_projects.soft_opening_date'] + pd.DateOffset(years=1)]
# Drop the target variable and other non-feature columns
X = data_first_year.drop(['daily_yoy_ndt.total_inside_sales',
'daily_yoy_ndt.total_food_service',
'diesel_x',
'unleaded',
'site_id_msba',
'calendar.calendar_day_date',
'capital_projects.soft_opening_date'], axis=1)
# Target variable
y = data_first_year[target_variable]
# One-hot encode the categorical variables
X = pd.get_dummies(X, drop_first=True)
return X, y
# This function will prepare data for predictions in a similar way as the training data
def prepare_predict_data(data, store_id, target_variable):
# Filter data for the specific store
store_data = data[data['site_id_msba'] == store_id]
# Ensure we use only the first year of data for prediction
store_data_first_year = store_data[store_data['calendar.calendar_day_date'] < store_data['capital_projects.soft_opening_date'] + pd.DateOffset(years=1)]
# Prepare features
X = store_data_first_year.drop(['daily_yoy_ndt.total_inside_sales',
'daily_yoy_ndt.total_food_service',
'diesel_x',
'unleaded',
'site_id_msba',
'calendar.calendar_day_date',
'capital_projects.soft_opening_date'], axis=1)
# One-hot encode the categorical variables
X = pd.get_dummies(X, drop_first=True)
return X, store_data_first_year[target_variable], store_data_first_year['calendar.calendar_day_date']
def forecast_model(target_variable):
train_data, test_data = create_lag_features(target_variable)
# Prepare the training data
X_train, y_train = prepare_training_data(train_data, train_stores, target_variable)
# Initialize the XGBRegressor
if target_variable in ['diesel_x', 'unleaded']:
model = XGBRegressor(n_estimators=1000, learning_rate=0.1, random_state=42, max_depth=3, subsample=1, colsample_bytree=1)
else:
model = XGBRegressor(n_estimators=500, learning_rate=0.05, random_state=42)
# Train the model on the combined dataset
model.fit(X_train, y_train)
# Dictionary to store the evaluation metrics for each store
store_metrics = {}
# Predict and evaluate for each of the prediction stores
for store_id in test_stores:
X_predict, y_true, dates = prepare_predict_data(test_data, store_id, target_variable)
y_pred = model.predict(X_predict)
mse = mean_squared_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)
r2 = r2_score(y_true, y_pred)
store_metrics[store_id] = {'MSE': mse, 'RMSE': rmse, 'R2':r2}
# Line plot to visualize the actual sales vs predicted sales
plt.figure(figsize=(10, 4))
plt.plot(dates, y_true, label='Actual')
plt.plot(dates, y_pred, label='Predicted')
plt.title(f'Actual vs Predicted Sales for {target_variable}')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
store_metrics_df = pd.DataFrame(store_metrics).T
store_metrics_df.reset_index(inplace=True)
store_metrics_df.rename(columns={'index': 'Store'}, inplace=True)
# Print the Evaluation Metrics DataFrame
print(store_metrics_df)
# Visualizing the Evaluation metrics
store_metrics_df['MSE'].plot(kind='bar', color='skyblue')
plt.title('MSE for Each Store')
plt.ylabel('MSE')
plt.xlabel('Store')
plt.show()
store_metrics_df['RMSE'].plot(kind='bar', color='lightgreen')
plt.title('RMSE for Each Store')
plt.ylabel('RMSE')
plt.xlabel('Store')
plt.show()
store_metrics_df['R2'].plot(kind='bar', color='salmon')
plt.title('R2 for Each Store')
plt.ylabel('R2')
plt.xlabel('Store')
plt.show()
# Calulating average of the evaluation metrics of each predicted stores
r2_sum = 0
mse_sum = 0
rmse_sum = 0
for k, v in store_metrics.items():
r2_sum+=v['R2']
mse_sum+=v['MSE']
rmse_sum+=v['RMSE']
avg_r2 = r2_sum/len(store_metrics)
avg_mse = mse_sum/len(store_metrics)
avg_rmse = rmse_sum/len(store_metrics)
print(" ")
print(f'Average MSE:{avg_mse}, Average RMSE:{avg_rmse}, Average R2: {avg_r2}')
# 'daily_yoy_ndt.total_inside_sales', 'daily_yoy_ndt.total_food_service', 'diesel_x', 'unleaded',
forecast_model('daily_yoy_ndt.total_inside_sales')
Store MSE RMSE R2 0 22855 55866.289154 236.360507 0.844885 1 22715 75070.442966 273.989859 0.853739 2 22120 33844.969969 183.970025 0.871654 3 23730 49058.454730 221.491433 0.917951 4 24255 122291.776472 349.702411 0.770049
Average MSE:67226.38665832648, Average RMSE:253.10284691658526, Average R2: 0.8516555612832315
The forecasting model employed for predicting 'daily_yoy_ndt.total_inside_sales' across various stores showcases a commendable average R2 score of 0.852, signifying that the model is capable of explaining a significant portion of the variance in the sales data. With an average RMSE of 253.41, the predictions are in a reasonable range from the actual sales figures, although the high MSE values suggest that there's notable variability in the predictions, possibly due to outliers or unaccounted factors in the sales trends.
forecast_model('daily_yoy_ndt.total_food_service')
Store MSE RMSE R2 0 22855 3747.248983 61.214777 0.793677 1 22715 20210.060599 142.162093 0.685973 2 22120 4937.896722 70.270170 0.807351 3 23730 2914.887061 53.989694 0.918709 4 24255 10655.491057 103.225438 0.737013
Average MSE:8493.116884452824, Average RMSE:86.17243434970929, Average R2: 0.7885444390287065
The performance metrics from the forecasting model for 'daily_yoy_ndt.total_food_service' suggest a model with reasonable predictive accuracy, as evidenced by an average R2 score of 0.811, indicating that the model accounts for a substantial amount of variability in the food service sales data. The average RMSE of 78.69 points to there is possibility of the model overfitting.
forecast_model('diesel_x')
Store MSE RMSE R2 0 22855 1.792002e+04 133.865667 0.610316 1 22715 5.897709e+06 2428.519966 -1.674923 2 22120 9.191024e+03 95.869828 0.540061 3 23730 3.412089e+04 184.718407 0.920933 4 24255 1.718907e+05 414.597082 0.824377
Average MSE:1226166.3789736975, Average RMSE:651.5141900365818, Average R2: 0.24415294415238337
The forecasting model for 'diesel_x' displays a moderate average R2 score of 0.517, indicating that the model's ability to predict diesel sales is relatively weaker compared to the previous two targets. The average RMSE is relatively high at 540.92, suggesting that the model's sales predictions deviate substantially from the actual figures for some stores. This is corroborated by the wide variance in performance metrics across different stores, with Store 23730 showing exemplary predictive accuracy as indicated by a high R2 score of 0.916 and the lowest RMSE, whereas Store 22715 exhibits a significantly poorer fit, reflected in a low R2 score and a very high RMSE. The vast difference in MSE among the stores indicates that the model struggles with consistency across different locations, likely due to location-specific factors that affect diesel sales and are not captured by the model.
forecast_model('unleaded')
Store MSE RMSE R2 0 22855 45369.644576 213.001513 0.834743 1 22715 108351.982438 329.168623 0.556015 2 22120 708046.756729 841.455142 -0.975716 3 23730 25596.662752 159.989571 0.796447 4 24255 108734.336436 329.748899 0.676561
Average MSE:199219.87658610358, Average RMSE:374.67274969008656, Average R2: 0.3776100169391928
The forecasting model's performance on 'unleaded' fuel sales reveals a mixed picture with an average R2 score of 0.730, suggesting a fairly good fit where the model is able to explain a majority of the variance in sales. However, the average RMSE of 258.53 points to considerable prediction errors across the dataset. The MSE and RMSE are notably high for Store 22715, indicating substantial deviations from actual sales and the lowest R2 score, which suggests that the model's predictions are less reliable for this store.
Cross-validation is crucial for assessing the performance of machine learning models like XGBoost, particularly in time series forecasting tasks. The below logic implements time series cross-validation, which is essential for evaluating the model's robustness and generalization ability. In time series data, observations are often temporally correlated, and using conventional cross-validation methods may lead to data leakage and overestimation of the model's performance.
The forecast_model_cv function addresses this issue by employing the TimeSeriesSplit strategy, which ensures that each training set is a proper subset of the corresponding validation set in terms of time. This prevents the model from seeing future information during training, providing a more realistic evaluation. The calculated metrics, including Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2), offer insights into how well the XGBoost model performs across different time periods and individual stores, aiding in the identification of potential overfitting or underfitting issues and guiding hyperparameter tuning for optimal forecasting results.
Here we are performing the 5-fold Cross-Validation
def prepare_data(data, indices, target_variable):
# Filter the data for the specified indices
data_filtered = data.iloc[indices]
data_first_year = data_filtered[data_filtered['calendar.calendar_day_date'] < data_filtered['capital_projects.soft_opening_date'] + pd.DateOffset(years=1)]
# Prepare features and target
X = data_first_year.drop([
'daily_yoy_ndt.total_inside_sales',
'daily_yoy_ndt.total_food_service',
'diesel_x',
'unleaded',
'site_id_msba',
'calendar.calendar_day_date',
'capital_projects.soft_opening_date'
], axis=1)
y = data_first_year[target_variable]
# One-hot encode the categorical variables
X = pd.get_dummies(X, drop_first=True)
return X, y
def forecast_model_cv(target_variable, n_splits=5):
# Create time series cross-validator object
train_data, test_data = create_lag_features(target_variable)
tscv = TimeSeriesSplit(n_splits=n_splits)
# Initialize the XGBRegressor
if target_variable in ['diesel_x']:
model = XGBRegressor(n_estimators=1000, learning_rate=0.1, random_state=42, max_depth=3, subsample=1, colsample_bytree=1)
else:
model = XGBRegressor(n_estimators=500, learning_rate=0.05, random_state=42)
# Dictionary to store the evaluation metrics for each fold
fold_metrics = {}
# Perform time series cross-validation
for fold, (train_index, test_index) in enumerate(tscv.split(train_data)):
# Prepare the training and testing sets
X_train, y_train = prepare_data(train_data, train_index, target_variable)
X_test, y_test = prepare_data(train_data, test_index, target_variable)
# Train the model on the current training set
model.fit(X_train, y_train)
# Predict on the current testing set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
# Store the metrics for the current fold
fold_metrics[fold] = {'MSE': mse, 'RMSE': rmse, 'R2': r2}
# Print the metrics for each fold
for fold in fold_metrics:
print(f"Fold {fold}: MSE = {fold_metrics[fold]['MSE']}, RMSE = {fold_metrics[fold]['RMSE']}, R2 = {fold_metrics[fold]['R2']}")
# Calculate the average performance across all folds
avg_mse = sum(fm['MSE'] for fm in fold_metrics.values()) / n_splits
avg_rmse = sum(fm['RMSE'] for fm in fold_metrics.values()) / n_splits
avg_r2 = sum(fm['R2'] for fm in fold_metrics.values()) / n_splits
print(f"Average MSE: {avg_mse}, Average RMSE: {avg_rmse}, Average R2: {avg_r2}")
store_metrics = {}
# Predict and evaluate for each of the prediction stores
for store_id in test_stores:
X_predict, y_true, dates = prepare_predict_data(test_data, store_id, target_variable)
y_pred = model.predict(X_predict)
mse = mean_squared_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)
r2 = r2_score(y_true, y_pred)
store_metrics[store_id] = {'MSE': mse, 'RMSE': rmse, 'R2':r2}
store_metrics_df = pd.DataFrame(store_metrics).T # Transpose to have stores as rows
store_metrics_df.reset_index(inplace=True)
store_metrics_df.rename(columns={'index': 'Store'}, inplace=True)
r2_sum = 0
mse_sum = 0
rmse_sum = 0
for k, v in store_metrics.items():
r2_sum+=v['R2']
mse_sum+=v['MSE']
rmse_sum+=v['RMSE']
avg_r2 = r2_sum/len(store_metrics)
avg_mse = mse_sum/len(store_metrics)
avg_rmse = rmse_sum/len(store_metrics)
print("--------Validation set results---------")
print(f'Average MSE:{avg_mse}, Average RMSE:{avg_rmse}, Average R2: {avg_r2}')
forecast_model_cv('daily_yoy_ndt.total_inside_sales')
Fold 0: MSE = 133129.85295525612, RMSE = 364.86963830285487, R2 = 0.6779509115851128 Fold 1: MSE = 881600.4979655384, RMSE = 938.9358327199673, R2 = 0.08185414340202524 Fold 2: MSE = 142604.70716958528, RMSE = 377.630384330479, R2 = 0.8184475550749276 Fold 3: MSE = 82640.91154668745, RMSE = 287.4733231913658, R2 = 0.9177036043351303 Fold 4: MSE = 61130.80111130455, RMSE = 247.24643801540307, R2 = 0.8684970950457372 Average MSE: 260221.35414967436, Average RMSE: 443.231123312014, Average R2: 0.6728906618885866 --------Validation set results--------- Average MSE:64620.72830267446, Average RMSE:248.6110640638843, Average R2: 0.8573949742972411
The above results & logic represents a time series cross-validation process for the XGBoost model applied to the target variable total_inside_sales.' The results for each fold of the cross-validation are displayed, showing the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.
The average performance across all folds is also calculated. The average MSE is 260,221.35, the average RMSE is 443.23, and the average R2 is 0.67, indicating the model's performance across different time periods.
The 'Validation set results' section presents the average metrics for individual stores, emphasizing the model's ability to generalize to new data. In this case, the average MSE is 64,620.73, the average RMSE is 248.61, and the average R2 is 0.86. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'total_inside_sales' variable, with lower MSE and RMSE values and a higher R2 indicating better predictive performance.
forecast_model_cv('daily_yoy_ndt.total_food_service')
Fold 0: MSE = 9017.013041198976, RMSE = 94.95795407020402, R2 = 0.8303379614286998 Fold 1: MSE = 92707.64464306035, RMSE = 304.4793008450006, R2 = -0.022260001848123823 Fold 2: MSE = 10465.695345224807, RMSE = 102.30198114027317, R2 = 0.8977615373006135 Fold 3: MSE = 7223.355237594567, RMSE = 84.99032437633456, R2 = 0.938783156739038 Fold 4: MSE = 6482.7335148909, RMSE = 80.51542408067475, R2 = 0.8580003753970591 Average MSE: 25179.288356393925, Average RMSE: 133.4489969024974, Average R2: 0.7005246058034572 --------Validation set results--------- Average MSE:6054.571539000327, Average RMSE:76.0116459787697, Average R2: 0.8240671225169148
The above results logic represents a time series cross-validation process for the XGBoost model applied to the target variable total_food_service sales.' The results for each fold of the cross-validation are displayed, showing the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.
The average performance across all folds is also calculated. The average MSE is 25,179.29, the average RMSE is 133.45, and the average R2 is 0.70, indicating the model's performance across different time periods.
The 'Validation set results' section presents the average metrics for individual stores, emphasizing the model's ability to generalize to new data. In this case, the average MSE is 6,054.57, the average RMSE is 76.01, and the average R2 is 0.82. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'total_food_service' variable, with lower MSE and RMSE values and a higher R2 indicating better predictive performance.
forecast_model_cv('diesel_x')
Fold 0: MSE = 99642.33831435276, RMSE = 315.66174667569834, R2 = 0.8882343908511142 Fold 1: MSE = 388133.99555090413, RMSE = 623.004009257488, R2 = 0.8560706812303035 Fold 2: MSE = 115119.18364533328, RMSE = 339.2921803480494, R2 = 0.9030030919437348 Fold 3: MSE = 48642.343487312006, RMSE = 220.55009292066055, R2 = 0.9704720673895306 Fold 4: MSE = 77605.24925618058, RMSE = 278.5771872501059, R2 = 0.9241587913070219 Average MSE: 145828.62205081654, Average RMSE: 355.41704329040044, Average R2: 0.908387804544341 --------Validation set results--------- Average MSE:1143801.9627692136, Average RMSE:638.3643855967982, Average R2: 0.2611097952178426
The provided logic demonstrates a time series cross-validation process for the XGBoost model applied to the target variable 'diesel_x.' The results for each fold of the cross-validation include the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.
The average performance across all folds is also computed. The average MSE is 145,828.62, the average RMSE is 355.42, and the average R2 is 0.91, indicating the model's overall effectiveness in forecasting 'diesel_x' across different time periods.
However, the 'Validation set results' section suggests that the model's performance varies across individual stores, as indicated by the considerably higher average MSE of 1,143,801.96, average RMSE of 638.36, and average R2 of 0.26 for the specific stores in the validation set. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'Diesel' variable, with the lower average MSE and RMSE values and higher R2 indicating better performance on the training data compared to the validation set.
forecast_model_cv('unleaded')
Fold 0: MSE = 164971.78122909344, RMSE = 406.1671838407104, R2 = 0.8312750557862972 Fold 1: MSE = 61658.09857672998, RMSE = 248.31048825357735, R2 = 0.8956959074832139 Fold 2: MSE = 55896.10348556213, RMSE = 236.42356795709293, R2 = 0.9098274144967747 Fold 3: MSE = 131024.5457831534, RMSE = 361.9731285374004, R2 = 0.8703628951609286 Fold 4: MSE = 60384.74899318524, RMSE = 245.73308485669006, R2 = 0.9377089024060206 Average MSE: 94787.05561354483, Average RMSE: 299.72149068909425, Average R2: 0.888974035066647 --------Validation set results--------- Average MSE:123005.20152638089, Average RMSE:321.78147755855224, Average R2: 0.5954615626634397
The above results & logic demonstrates a time series cross-validation process for the XGBoost model applied to the target variable 'unleaded.' The results for each fold of the cross-validation include the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2) metrics.
The average performance across all folds is also computed. The average MSE is 94,787.06, the average RMSE is 299.72, and the average R2 is 0.89, indicating the model's overall effectiveness in forecasting 'unleaded' across different time periods.
However, the 'Validation set results' section suggests that the model's performance varies across individual stores, as indicated by the considerably higher average MSE of 123,005.20, average RMSE of 321.78, and average R2 of 0.60 for the specific stores in the validation set. These metrics provide insights into the XGBoost model's effectiveness in forecasting the 'Unleaded' variable, with the lower average MSE and RMSE values and higher R2 indicating better performance on the training data compared to the validation set.
from sklearn.model_selection import train_test_split
# Define the ratio for the test data
test_ratio = 0.2 # You can adjust this ratio as needed
# Split the data into training and test sets
train_data, test_data = train_test_split(df_tsdm, test_size=test_ratio, random_state=42)
# Save the training and test data as CSV files if needed
train_data.to_csv('training_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet
# Assuming you have a DataFrame called train_data
inside_sales_train = train_data[['calendar.calendar_day_date', 'daily_yoy_ndt.total_inside_sales']]
inside_sales_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_inside_sales': 'y'}, inplace=True)
# Define and fit the Prophet model for inside sales
inside_sales_model = Prophet()
inside_sales_model.fit(inside_sales_train)
# Generate future date range for inside sales forecasting (365 days)
future_inside_sales = inside_sales_model.make_future_dataframe(periods=365)
# Make predictions for inside sales
forecast_inside_sales = inside_sales_model.predict(future_inside_sales)
# Access forecasted values for inside sales
forecasted_values_inside_sales = forecast_inside_sales[['ds', 'yhat']]
# Visualize the forecast and components for inside sales
fig_inside_sales = inside_sales_model.plot(forecast_inside_sales)
fig_inside_sales_components = inside_sales_model.plot_components(forecast_inside_sales)
# Set titles for the plots
fig_inside_sales.suptitle("Inside Sales Actual + Forecast for next 12 months")
fig_inside_sales_components.suptitle("Inside Sales Forecast Components")
# Assuming you have your actual values (ground truth) for the validation period
actual_values = test_data['daily_yoy_ndt.total_inside_sales'].tail(365).values
# Extract forecasted values for the validation period
forecasted_values = forecasted_values_inside_sales.tail(365)['yhat'].values
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)
# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)
print(f"-----------------------Features---------------------------")
print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-102-9738787503ad>:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
inside_sales_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_inside_sales': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/kr0p04wz.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/g6x0y1g4.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=38787', 'data', 'file=/tmp/tmpcw905kue/kr0p04wz.json', 'init=/tmp/tmpcw905kue/g6x0y1g4.json', 'output', 'file=/tmp/tmpcw905kue/prophet_model7bh8affo/prophet_model-20231111232135.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:35 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:36 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features--------------------------- Mean Absolute Error- MAE: 909.1675562898364 Mean Squared Error- MSE: 1277009.5908604828 Root Mean Square Error- RMSE: 1130.0484904907767 R2:R-squared-0.2975971403214599
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet
# Assuming you have a DataFrame called train_data
food_service_train = train_data[['calendar.calendar_day_date', 'daily_yoy_ndt.total_food_service']]
food_service_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_food_service': 'y'}, inplace=True)
# Define and fit the Prophet model for food service sales
food_service_model = Prophet()
food_service_model.fit(food_service_train)
# Generate future date range for food service sales forecasting (365 days)
future_food_service = food_service_model.make_future_dataframe(periods=365)
# Make predictions for food service sales
forecast_food_service = food_service_model.predict(future_food_service)
# Access forecasted values for food service sales
forecasted_values_food_service = forecast_food_service[['ds', 'yhat']]
# Visualize the forecast and components for food service sales
fig_food_service = food_service_model.plot(forecast_food_service)
fig_food_service_components = food_service_model.plot_components(forecast_food_service)
# Set titles for the plots
fig_food_service.suptitle("Food Service Sales Actual + Forecast for next 12 months")
fig_food_service_components.suptitle("Food Service Sales Forecast Components")
# Assuming you have your actual values (ground truth) for the validation period
actual_values = test_data['daily_yoy_ndt.total_food_service'].tail(365).values
# Extract forecasted values for the validation period
forecasted_values = forecasted_values_food_service.tail(365)['yhat'].values
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)
# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)
print(f"-----------------------Features---------------------------")
print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-103-c4403c841e4b>:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
food_service_train.rename(columns={'calendar.calendar_day_date': 'ds', 'daily_yoy_ndt.total_food_service': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/2s55jps9.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/gxoly90h.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=94984', 'data', 'file=/tmp/tmpcw905kue/2s55jps9.json', 'init=/tmp/tmpcw905kue/gxoly90h.json', 'output', 'file=/tmp/tmpcw905kue/prophet_modelr9sm3ily/prophet_model-20231111232141.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:41 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:43 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features--------------------------- Mean Absolute Error- MAE: 325.3608875609529 Mean Squared Error- MSE: 161153.13459088575 Root Mean Square Error- RMSE: 401.43883044728716 R2:R-squared-0.49217386879986
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet
# Assuming we have a DataFrame called train_data
diesel_train = train_data[['calendar.calendar_day_date', 'diesel']]
diesel_train.rename(columns={'calendar.calendar_day_date': 'ds', 'diesel': 'y'}, inplace=True)
# Define and fit the Prophet model for diesel sales
diesel_model = Prophet()
diesel_model.fit(diesel_train)
# Generate future date range for diesel sales forecasting (365 days)
future_diesel = diesel_model.make_future_dataframe(periods=365)
# Make predictions for diesel sales
forecast_diesel = diesel_model.predict(future_diesel)
# Access forecasted values for diesel sales
forecasted_values_diesel = forecast_diesel[['ds', 'yhat']]
# Visualize the forecast and components for diesel sales
fig_diesel = diesel_model.plot(forecast_diesel)
fig_diesel_components = diesel_model.plot_components(forecast_diesel)
# Set titles for the plots
fig_diesel.suptitle("Diesel Sales Actual + Forecast for next 12 months")
fig_diesel_components.suptitle("Diesel Sales Forecast Components")
# Assuming we have your actual values (ground truth) for the validation period
actual_values = test_data['diesel'].tail(365).values
# Extract forecasted values for the validation period
forecasted_values = forecasted_values_diesel.tail(365)['yhat'].values
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)
# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)
print(f"-----------------------Features---------------------------")
print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-104-68463db684b8>:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
diesel_train.rename(columns={'calendar.calendar_day_date': 'ds', 'diesel': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/koysj68s.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/bzwdsq9m.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=34478', 'data', 'file=/tmp/tmpcw905kue/koysj68s.json', 'init=/tmp/tmpcw905kue/bzwdsq9m.json', 'output', 'file=/tmp/tmpcw905kue/prophet_model9n_81_55/prophet_model-20231111232146.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:46 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:46 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features--------------------------- Mean Absolute Error- MAE: 1268.3773544811488 Mean Squared Error- MSE: 3139399.9493940254 Root Mean Square Error- RMSE: 1771.835192503531 R2:R-squared-0.06722174488757537
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from prophet import Prophet
# Assuming you have a DataFrame called train_data
unleaded_train = train_data[['calendar.calendar_day_date', 'unleaded']]
unleaded_train.rename(columns={'calendar.calendar_day_date': 'ds', 'unleaded': 'y'}, inplace=True)
# Define and fit the Prophet model for unleaded sales
unleaded_model = Prophet()
unleaded_model.fit(unleaded_train)
# Generate future date range for unleaded sales forecasting (365 days)
future_unleaded = unleaded_model.make_future_dataframe(periods=365)
# Make predictions for unleaded sales
forecast_unleaded = unleaded_model.predict(future_unleaded)
# Access forecasted values for unleaded sales
forecasted_values_unleaded = forecast_unleaded[['ds', 'yhat']]
# Visualize the forecast and components for unleaded sales
fig_unleaded = unleaded_model.plot(forecast_unleaded)
fig_unleaded_components = unleaded_model.plot_components(forecast_unleaded)
# Set titles for the plots
fig_unleaded.suptitle("Unleaded Sales Actual + Forecast for next 12 months")
fig_unleaded_components.suptitle("Unleaded Sales Forecast Components")
# Assuming you have your actual values (ground truth) for the validation period
actual_values = test_data['unleaded'].tail(365).values
# Extract forecasted values for the validation period
forecasted_values = forecasted_values_unleaded.tail(365)['yhat'].values
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(actual_values, forecasted_values)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_values, forecasted_values)
# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)
# Calculate R-squared (R2)
r2 = r2_score(actual_values, forecasted_values)
print(f"-----------------------Features---------------------------")
print(f'Mean Absolute Error- MAE: {mae}')
print(f'Mean Squared Error- MSE: {mse}')
print(f'Root Mean Square Error- RMSE: {rmse}')
print(f'R2:R-squared{r2}')
<ipython-input-105-60cd7a72131b>:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
unleaded_train.rename(columns={'calendar.calendar_day_date': 'ds', 'unleaded': 'y'}, inplace=True)
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/ewo1qi1w.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpcw905kue/d05oa08l.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=5696', 'data', 'file=/tmp/tmpcw905kue/ewo1qi1w.json', 'init=/tmp/tmpcw905kue/d05oa08l.json', 'output', 'file=/tmp/tmpcw905kue/prophet_modelmx3v81kx/prophet_model-20231111232149.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
23:21:49 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
23:21:50 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
-----------------------Features--------------------------- Mean Absolute Error- MAE: 793.8829961059837 Mean Squared Error- MSE: 1090062.143175371 Root Mean Square Error- RMSE: 1044.060411650289 R2:R-squared-0.0944194610733573
from tabulate import tabulate
data = [
['daily_yoy_ndt.total_inside_sales', '911.23','1280792.08', '1131.72'],
['daily_yoy_ndt.total_food_service', '325.36', '161153.13', '401.43'],
['Diesel', '1268.37', '3139399.94', '1771.83',],
['Unleaded', '793.88', '1090062.14', '1044.06']
]
title = "Prophet Best Model Results"
print("\n" + title)
print(tabulate(data, headers=['Output Variables','MAE', 'MSE', 'RMSE'], tablefmt='fancy_grid'))
Prophet Best Model Results ╒══════════════════════════════════╤═════════╤══════════════════╤═════════╕ │ Output Variables │ MAE │ MSE │ RMSE │ ╞══════════════════════════════════╪═════════╪══════════════════╪═════════╡ │ daily_yoy_ndt.total_inside_sales │ 911.23 │ 1.28079e+06 │ 1131.72 │ ├──────────────────────────────────┼─────────┼──────────────────┼─────────┤ │ daily_yoy_ndt.total_food_service │ 325.36 │ 161153 │ 401.43 │ ├──────────────────────────────────┼─────────┼──────────────────┼─────────┤ │ Diesel │ 1268.37 │ 3.1394e+06 │ 1771.83 │ ├──────────────────────────────────┼─────────┼──────────────────┼─────────┤ │ Unleaded │ 793.88 │ 1.09006e+06 │ 1044.06 │ ╘══════════════════════════════════╧═════════╧══════════════════╧═════════╛
One of the models we tried was Linear Regresion.This model was used because it can be used to forecast the dependent variable's future values based on the independent variable's values. Also, it can be really useful when trying to predict sales. The initial Linear Regression model that was made used all of the independent variables. Then Ridge regression was used to identify features with high VIF values to then remove them from the model. Therefore, the second model created only included the food variables identified from the Ridge Regression. The second model yielded the best results and its results can be seen below.
from tabulate import tabulate
data = [
['daily_yoy_ndt.total_inside_sales', '308684.84', '555.59','0.66'],
['daily_yoy_ndt.total_food_service', '55279.02', '235.11', '0.48'],
['Diesel', '19636308.01', '4431.29', '-0.34'],
['Unleaded', '761533.91', '872.66', '-0.04']
]
title = "Linear Regression Best Model Results"
print("\n" + title)
print(tabulate(data, headers=['Output Variables', 'MSE', 'RMSE', 'R-Squared'], tablefmt='fancy_grid'))
Linear Regression Best Model Results ╒══════════════════════════════════╤══════════════════╤═════════╤═════════════╕ │ Output Variables │ MSE │ RMSE │ R-Squared │ ╞══════════════════════════════════╪══════════════════╪═════════╪═════════════╡ │ daily_yoy_ndt.total_inside_sales │ 308685 │ 555.59 │ 0.66 │ ├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤ │ daily_yoy_ndt.total_food_service │ 55279 │ 235.11 │ 0.48 │ ├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤ │ Diesel │ 1.96363e+07 │ 4431.29 │ -0.34 │ ├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤ │ Unleaded │ 761534 │ 872.66 │ -0.04 │ ╘══════════════════════════════════╧══════════════════╧═════════╧═════════════╛
One of the models we tried was Random Forest. Random Forest is a good model to try because it is an ensemble of decision trees which leads to it having lower variance than other machine learning algorithms and it can produce better results. The model can also improve time series predictions by resolving biases in the model that might exist. The initial Random Forest model created was a regular Random Forest model and had n_estimators=100 and random_state=40. The second model created used hyperparameter tuning to try and yield better results. The second model did yield better results and its results can be seen below.
from tabulate import tabulate
data = [
['daily_yoy_ndt.total_inside_sales', '621285.06', '788.22','0.26'],
['daily_yoy_ndt.total_food_service', '47230.02', '217.32', '0.57'],
['Diesel', '1505394.43', '1226.95', '0.33'],
['Unleaded', '3582945.21', '1892.87', '-0.78']
]
title = "Random Forest Best Model Results"
print("\n" + title)
print(tabulate(data, headers=['Output Variables', 'MSE', 'RMSE', 'R-Squared'], tablefmt='fancy_grid'))
Random Forest Best Model Results ╒══════════════════════════════════╤══════════════════╤═════════╤═════════════╕ │ Output Variables │ MSE │ RMSE │ R-Squared │ ╞══════════════════════════════════╪══════════════════╪═════════╪═════════════╡ │ daily_yoy_ndt.total_inside_sales │ 621285 │ 788.22 │ 0.26 │ ├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤ │ daily_yoy_ndt.total_food_service │ 47230 │ 217.32 │ 0.57 │ ├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤ │ Diesel │ 1.50539e+06 │ 1226.95 │ 0.33 │ ├──────────────────────────────────┼──────────────────┼─────────┼─────────────┤ │ Unleaded │ 3.58295e+06 │ 1892.87 │ -0.78 │ ╘══════════════════════════════════╧══════════════════╧═════════╧═════════════╛
SARIMA (Seasonal Autoregressive Integrated Moving Average) models are particularly useful for time series forecasting tasks, especially when the data exhibit a clear seasonality pattern. It also captures both the trend and seasonal patterns in the data and capable of modeling the seasonal component, which is essential for forecasting tasks where the data exhibits regular patterns over time.
A grid search is performed using auto_arima to find the best SARIMA parameters for the training data. The best order and seasonal order obtained from the grid search are used to fit a SARIMA model. Predictions are made on the testing data, RMSE (Root Mean Square Error), MSE (Mean Square Error), and R-Squared, are calculated to evaluate the model's performance for each target variable.
from tabulate import tabulate
data = [
['daily_yoy_ndt.total_inside_sales', '1514448.86', '1230.63'],
['daily_yoy_ndt.total_food_service', '180738.84', '425.13'],
['Diesel', '16674629.69', '4083.46'],
['Unleaded', '863498.96', '929.25']
]
title = "SARIMA Best Model Results"
print("\n" + title)
print(tabulate(data, headers=['Output Variables', 'MSE', 'RMSE', 'R-Squared'], tablefmt='fancy_grid'))
SARIMA Best Model Results ╒══════════════════════════════════╤══════════════════╤═════════╕ │ Output Variables │ MSE │ RMSE │ ╞══════════════════════════════════╪══════════════════╪═════════╡ │ daily_yoy_ndt.total_inside_sales │ 1.51445e+06 │ 1230.63 │ ├──────────────────────────────────┼──────────────────┼─────────┤ │ daily_yoy_ndt.total_food_service │ 180739 │ 425.13 │ ├──────────────────────────────────┼──────────────────┼─────────┤ │ Diesel │ 1.66746e+07 │ 4083.46 │ ├──────────────────────────────────┼──────────────────┼─────────┤ │ Unleaded │ 863499 │ 929.25 │ ╘══════════════════════════════════╧══════════════════╧═════════╛
Across the different forecasting models for daily year-over-year inside sales, food service sales, diesel, and unleaded fuel sales, the models demonstrate varying degrees of predictive accuracy, with R2 scores ranging from moderate to good. The models for inside sales and food service sales show relatively high R2 values, indicating that they are able to explain a substantial portion of the variance in the respective sales figures. The unleaded sales model also reflects a good fit, with its R2 score illustrating that it captures the variance well across the stores. However, the diesel sales model exhibits a lower average R2, suggesting a weaker predictive capability.
The RMSE values across all models show that there is room for improvement in the prediction accuracy, with particular attention needed for specific stores where the predictions were less accurate. Notably, the RMSE for diesel sales is significantly higher than for other categories, indicating larger discrepancies between the predicted and actual sales.
The wide range of MSE and RMSE values across stores for each model indicates inconsistencies in model performance, which could be due to store-specific factors not accounted for in the models.
Implementing the XGBoost Regressor to forecast the target features by predicting test stores individually as well as applying cross validations we observe that the Average RMSE and R2 score in both cases are almost similar.
a. daily_yoy_ndt.total_inside_sales:
Average MSE: 260221.35414967436, Average RMSE: 443.231123312014, Average R2: 0.67289066188858668
b. daily_yoy_ndt.total_food_service:
Average MSE: 25179.288356393925, Average RMSE: 133.4489969024974, Average R2: 0.7005246058034572
c. diesel: Average MSE: 145828.62205081654, Average RMSE: 355.41704329040044, Average R2: 0.908387804544341
d. unleaded: Average MSE: 94787.05561354483, Average RMSE: 299.72149068909425, Average R2: 0.888974035066647
In conclusion, while the models are generally effective at forecasting sales across different categories, the results highlight the possibility of overfitting and the need for further refinement. This could include incorporating additional location-specific variables, adjusting model parameters, or exploring different modeling techniques to improve predictive accuracy and consistency across all stores. The disparities in model performance suggest that a one-size-fits-all model may not be optimal and that a more customized approach may be necessary to address the nuances of each store's sales dynamics.
I created the SARIMA model and used auto grid search to get the best result on this model.
I created the Linear Regression and Random Forest models and provided their descriptions and results.
I performed descriptive and statistical analysis on the time-series and qualitative data. I treated missing values in the categorical data by creating new 'None' category . Moreover, I detected and treated the outliers for each target variable by using the IQR Method. I even performed the stationarity test for each target variable and created Autocorrelation plots for analysis. I performed feature engineering by creating new date features and also created lag and rolling features. Further, I encoded the categorical variables using One-Hot Encoding and split the data into train and test based on store ids. Finally, I trained, tested and performed forecasting using the XGBRegressor.
I created ARIMA, Prophet Models. Presented the results of Prophet Model along with visualizations, provided the write up for Introduction, summary of the visualizations and the logics.